Thanks a lot. It's really a nice day to see a lot of mathematicians stepping into the
area of trying to understand the mathematical foundation of deep neural networks. I've been
into this area since 2018 and it's been seven years. So it's really nice to see how this
field evolves and with new excitement. First of all, I want to just show you that now we
are kind of using models with increasing a large size. So you see like we have like GPT-2,
like 3 and 4. Maybe we don't know, it's just a guess. And also like DeepSeq,
and like DeepSeq v3 has 600 billion parameters and the new one, the leak is that it's even larger.
So why we are using these kind of very large systems to learn stuff. But we can observe that
in the evolution of the mammals and we can see from mouse to our human brain, actually the size
increased a lot. And that leads to the question. So it's a very, very large model with so many
parameters. Do we suffer from overfitting? Since we have a lot more degrees of freedom than the
mouse, should we worry about that? And the traditional wisdom tells us that if we have
some data, maybe we don't have enough, just infinite many data, probably we should start
with a small model, meaning that models with less parameters to start with. And if we blindly use
a large model, and then we could easily suffer from the problem of overfitting. And although we got
very good training error, however, it does not really tell you or give you the generalization
capability. So the traditional wisdom actually with lots of justification, for example,
justification for numerical analysis or from the statistical learning theory, there's all different
justifications of that, like context models easily overfit. And there's even philosophical
justification from the outcome reader that telling you that you shouldn't really consider a very
large model. However, as we can see from the success of like, for example, large network models,
and the scaling law tells you that even you just have very limited data, still using large models
benefits. So that's kind of really weird. And therefore, that's the sometimes we call apparent
paradox, or there's other other names that tells you, okay, that's the thing we must explain in
order to have a mathematical understanding. So when do we realize this problem? So start at least
from the Leo Bremer, very famous statistician. And in 1995, he poses four problems that are actually
not that well solved even till now. And particularly the first one, he says that if we have a very
heavily parameterized neural network, so how to understand its non-overfitting behavior, and we do
get some progress and some of you probably know about the Neurotangent Kernel theory. However,
the Neurotangent Kernel theory actually tells us that neural networks sometimes resembles a kernel
method. But in which sense the neural network model is superior to kernel methods? We don't really
understand. Therefore, the whole phenomenon I'm going to tell you is about this kind of
nonlinear behavior. I'm trying to understand how these overparameterized neural networks
control the complexity of the output function during the nonlinear training process. And you
can see if we have like we let the network to increase a complexity arbitrarily fast,
and then overfitting is actually unavoidable. It seems that there's some kind of nonlinear behavior
that helps the generalization. Okay, and then I will tell you what is a condensation phenomena.
So first of all, let's go to this very intuitive illustration. So we have this one hidden layer
neural network with five neurons. And initially, and all these ways are initialized randomly from
Gaussian distribution. So they are all different. However, during the training, you can observe the
following behavior where some of these input weights becomes equal to one another. And for
example, it's one and two and three, four and five becomes almost similar to one another. And if that
happens, we say, okay, it's a condense. And when the neuron is in a condensed state, and you can
see that we can combine different neurons into fewer neurons. So it is equivalent to a smaller
network. So if that really happens during the training kind of automatically, then it must be
a mechanism to control the complexity during the training. Okay, and let's look at the real
phenomenon. So here we have a one hidden layer neural network with like hundreds of neurons,
and the fitting is just one dimensional fitting. So with this blue dots are training points.
So here each red dot represent a neuron. And we can see that these wj, the x axis is the wj,
and the y axis is the bj. So each neuron, their orientation means the feature. And each the feature
Presenters
Zugänglich über
Offener Zugang
Dauer
00:30:27 Min
Aufnahmedatum
2025-04-29
Hochgeladen am
2025-04-30 15:43:25
Sprache
en-US
• Alessandro Coclite. Politecnico di Bari
• Fariba Fahroo. Air Force Office of Scientific Research
• Giovanni Fantuzzi. FAU MoD/DCN-AvH, Friedrich-Alexander-Universität Erlangen-Nürnberg
• Borjan Geshkovski. Inria, Sorbonne Université
• Paola Goatin. Inria, Sophia-Antipolis
• Shi Jin. SJTU, Shanghai Jiao Tong University
• Alexander Keimer. Universität Rostock
• Felix J. Knutson. Air Force Office of Scientific Research
• Anne Koelewijn. FAU MoD, Friedrich-Alexander-Universität Erlangen-Nürnberg
• Günter Leugering. FAU, Friedrich-Alexander-Universität Erlangen-Nürnberg
• Lorenzo Liverani. FAU, Friedrich-Alexander-Universität Erlangen-Nürnberg
• Camilla Nobili. University of Surrey
• Gianluca Orlando. Politecnico di Bari
• Michele Palladino. Università degli Studi dell’Aquila
• Gabriel Peyré. CNRS, ENS-PSL
• Alessio Porretta. Università di Roma Tor Vergata
• Francesco Regazzoni. Politecnico di Milano
• Domènec Ruiz-Balet. Université Paris Dauphine
• Daniel Tenbrinck. FAU, Friedrich-Alexander-Universität Erlangen-Nürnberg
• Daniela Tonon. Università di Padova
• Juncheng Wei. Chinese University of Hong Kong
• Yaoyu Zhang. Shanghai Jiao Tong University
• Wei Zhu. Georgia Institute of Technology